216036B - Fonseka K.N.N
# Install required packages for big data processing
!pip install pyspark
!pip install plotly
!pip install seaborn
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install imbalanced-learn
!pip install gdown
Requirement already satisfied: pyspark in /usr/local/lib/python3.12/dist-packages (3.5.1) Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.12/dist-packages (from pyspark) (0.10.9.7) Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (5.24.1) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.12/dist-packages (from plotly) (8.5.0) Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from plotly) (25.0) Requirement already satisfied: seaborn in /usr/local/lib/python3.12/dist-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.12/dist-packages (from seaborn) (2.0.2) Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.12/dist-packages (from seaborn) (2.2.2) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.12/dist-packages (from seaborn) (3.10.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.3) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.59.1) Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.9) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (25.0) Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (11.3.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.2.3) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.2->seaborn) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.2->seaborn) (2025.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0) Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2) Requirement already satisfied: numpy>=1.26.0 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.0.2) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0) Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.12/dist-packages (1.6.1) Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (2.0.2) Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.16.1) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.5.1) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (3.6.0) Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.12/dist-packages (0.14.0) Requirement already satisfied: numpy<3,>=1.25.2 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (2.0.2) Requirement already satisfied: scipy<2,>=1.11.4 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (1.16.1) Requirement already satisfied: scikit-learn<2,>=1.4.2 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (1.6.1) Requirement already satisfied: joblib<2,>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (1.5.1) Requirement already satisfied: threadpoolctl<4,>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (3.6.0) Requirement already satisfied: gdown in /usr/local/lib/python3.12/dist-packages (5.2.0) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (from gdown) (4.13.4) Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from gdown) (3.19.1) Requirement already satisfied: requests[socks] in /usr/local/lib/python3.12/dist-packages (from gdown) (2.32.4) Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from gdown) (4.67.1) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4->gdown) (2.7) Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4->gdown) (4.14.1) Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (3.4.3) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (2025.8.3) Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (1.7.1)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
import gdown
import os
import time
from datetime import datetime
import psutil
from pyspark.sql import SparkSession
%matplotlib inline
import seaborn as sns
import folium
import warnings
warnings.filterwarnings('ignore')
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from matplotlib import cm
from matplotlib.colors import to_hex
from io import BytesIO
import base64
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.decomposition import PCA
from prophet import Prophet
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
from scipy.stats import linregress
from pprint import pprint
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
# PySpark imports for big data processing
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.sql.functions import min as spark_min, max as spark_max, count, col
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql.functions import col
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler, PCA
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, count
from pyspark.ml.feature import Bucketizer
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import functions as F
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col, when
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql.types import DoubleType, IntegerType
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder \
.appName("GTD Big Data Analysis") \
.config("spark.executor.memory", "4g") \
.config("spark.driver.memory", "4g") \
.getOrCreate()
To carry out this comprehensive big data analytics assignment, I imported a wide range of essential libraries that support data manipulation, preprocessing, visualization, machine learning, clustering, forecasting, and natural language processing. Core libraries like pandas, numpy, matplotlib, and seaborn were used for data handling, numerical computations, and statistical visualizations. Big data tools such as pyspark was utilized to efficiently manage and process large-scale datasets. For advanced visualizations and interactivity, libraries like plotly, folium, and missingno were employed to create insightful charts, geographic maps, and missing value patterns. Clustering and dimensionality reduction were performed using K-Means, and PCA, while time series forecasting was implemented using Facebook's Prophet. For textual analysis, libraries like CountVectorizer, wordcloud, and PIL enabled the extraction and representation of key themes from incident summaries. Machine learning models such as RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression were applied to build predictive models for different classification problems. In addition, tools like StandardScaler, LabelEncoder, and train_test_split helped prepare the data for modeling, while performance metrics such as confusion_matrix and classification_report facilitated model evaluation.
PySpark was imported and a Spark session was initialized with Colab-optimized settings to handle large-scale data efficiently, enabling distributed computation, feature engineering, and model building on big datasets. This setup ensured that both exploratory and predictive analyses could be conducted effectively on the global terrorism dataset.
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)
# ---------------- Performance Monitoring ----------------
def monitor_loading_performance():
process = psutil.Process()
return {
'memory_mb': process.memory_info().rss / 1024**2,
'cpu_percent': psutil.cpu_percent(interval=1),
'timestamp': datetime.now()
}
# Dataset path in Colab
dataset_path = "/content/Dataset_BigData.csv"
# ---------------- Check File ----------------
if os.path.exists(dataset_path):
file_size_mb = os.path.getsize(dataset_path) / (1024**2)
print(f"\n Dataset found: {dataset_path} ({file_size_mb:.2f} MB)")
else:
raise FileNotFoundError(f"Dataset not found at {dataset_path}!")
# ---------------- Pandas Loading ----------------
print("\n Loading with Pandas...")
start_time = time.time()
initial_memory = monitor_loading_performance()
# Load into raw_gtd_df instead of df_pandas
raw_gtd_df = pd.read_csv(dataset_path, encoding='ISO-8859-1')
pandas_load_time = time.time() - start_time
pandas_memory = monitor_loading_performance()
print(f" Pandas loaded in {pandas_load_time:.2f}s | Memory: {pandas_memory['memory_mb']:.1f} MB | Shape: {raw_gtd_df.shape}")
# ---------------- Spark Loading ----------------
print("\n Loading with Spark...")
spark_start = time.time()
df_spark = spark.read.csv(dataset_path, header=True, inferSchema=True)
df_spark.cache()
spark_count = df_spark.count()
spark_load_time = time.time() - spark_start
spark_memory = monitor_loading_performance()
print(f" Spark loaded in {spark_load_time:.2f}s | Memory: {spark_memory['memory_mb']:.1f} MB | Records: {spark_count:,} | Partitions: {df_spark.rdd.getNumPartitions()}")
# ---------------- Performance Comparison ----------------
print("\n LOADING PERFORMANCE ANALYSIS:")
print(f" Pandas: {pandas_load_time:.2f}s")
print(f" Spark: {spark_load_time:.2f}s")
if spark_load_time < pandas_load_time:
print(f" Spark is {pandas_load_time / spark_load_time:.2f}x faster than Pandas")
else:
print(f" Pandas is {spark_load_time / pandas_load_time:.2f}x faster than Spark")
Dataset found: /content/Dataset_BigData.csv (155.27 MB)
Loading with Pandas...
Pandas loaded in 5.74s | Memory: 1972.4 MB | Shape: (181691, 135)
Loading with Spark...
Spark loaded in 5.01s | Memory: 1972.4 MB | Records: 181,691 | Partitions: 2
LOADING PERFORMANCE ANALYSIS:
Pandas: 5.74s
Spark: 5.01s
Spark is 1.15x faster than Pandas
The dataset was loaded and analyzed using both Pandas and PySpark to compare performance and resource utilization. Using Pandas, the 155 MB dataset containing 181,691 records and 135 features was loaded in 5.74 seconds with a memory usage of approximately 1,130 MB. In contrast, PySpark was able to load the same dataset in 5.01 seconds, caching the data for efficient distributed processing across 2 partitions. The performance comparison indicates that PySpark is 1.15 times faster than Pandas for this dataset, demonstrating the advantages of using Spark for handling larger datasets and performing scalable data processing in a big data environment.
import chardet
# Read a small sample of the file to guess the encoding
with open(dataset_path, 'rb') as f:
rawdata = f.read(10000) # read first 10 KB
result = chardet.detect(rawdata)
print(result)
{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
# Load the raw data
raw_gtd_df = spark.read.csv(
dataset_path,
header=True,
inferSchema=True,
encoding='ISO-8859-1'
)
This block loads the Global Terrorism Database CSV file into a Spark DataFrame using the specified file path and appropriate character encoding to handle special characters.
# Number of rows
num_rows = raw_gtd_df.count()
# Number of columns
num_cols = len(raw_gtd_df.columns)
print(f"Shape: ({num_rows}, {num_cols})")
Shape: (181691, 135)
This line returns the shape of the dataset, indicating that it contains 181,691 rows (incidents) and 135 columns (features).
#To check the name of features.
# List all feature names / columns
feature_names = raw_gtd_df.columns
print(feature_names)
['eventid', 'iyear', 'imonth', 'iday', 'approxdate', 'extended', 'resolution', 'country', 'country_txt', 'region', 'region_txt', 'provstate', 'city', 'latitude', 'longitude', 'specificity', 'vicinity', 'location', 'summary', 'crit1', 'crit2', 'crit3', 'doubtterr', 'alternative', 'alternative_txt', 'multiple', 'success', 'suicide', 'attacktype1', 'attacktype1_txt', 'attacktype2', 'attacktype2_txt', 'attacktype3', 'attacktype3_txt', 'targtype1', 'targtype1_txt', 'targsubtype1', 'targsubtype1_txt', 'corp1', 'target1', 'natlty1', 'natlty1_txt', 'targtype2', 'targtype2_txt', 'targsubtype2', 'targsubtype2_txt', 'corp2', 'target2', 'natlty2', 'natlty2_txt', 'targtype3', 'targtype3_txt', 'targsubtype3', 'targsubtype3_txt', 'corp3', 'target3', 'natlty3', 'natlty3_txt', 'gname', 'gsubname', 'gname2', 'gsubname2', 'gname3', 'gsubname3', 'motive', 'guncertain1', 'guncertain2', 'guncertain3', 'individual', 'nperps', 'nperpcap', 'claimed', 'claimmode', 'claimmode_txt', 'claim2', 'claimmode2', 'claimmode2_txt', 'claim3', 'claimmode3', 'claimmode3_txt', 'compclaim', 'weaptype1', 'weaptype1_txt', 'weapsubtype1', 'weapsubtype1_txt', 'weaptype2', 'weaptype2_txt', 'weapsubtype2', 'weapsubtype2_txt', 'weaptype3', 'weaptype3_txt', 'weapsubtype3', 'weapsubtype3_txt', 'weaptype4', 'weaptype4_txt', 'weapsubtype4', 'weapsubtype4_txt', 'weapdetail', 'nkill', 'nkillus', 'nkillter', 'nwound', 'nwoundus', 'nwoundte', 'property', 'propextent', 'propextent_txt', 'propvalue', 'propcomment', 'ishostkid', 'nhostkid', 'nhostkidus', 'nhours', 'ndays', 'divert', 'kidhijcountry', 'ransom', 'ransomamt', 'ransomamtus', 'ransompaid', 'ransompaidus', 'ransomnote', 'hostkidoutcome', 'hostkidoutcome_txt', 'nreleased', 'addnotes', 'scite1', 'scite2', 'scite3', 'dbsource', 'INT_LOG', 'INT_IDEO', 'INT_MISC', 'INT_ANY', 'related']
This command lists all the column names in the dataset, allowing the user to understand the available variables and plan the analysis accordingly.
# Show first 5 rows
raw_gtd_df.show(5)
+------------+-----+------+----+----------+--------+----------+-------+------------------+------+--------------------+---------+-------------+---------+----------+-----------+--------+--------+-------+-----+-----+-----+---------+-----------+---------------+--------+-------+-------+-----------+--------------------+-----------+---------------+-----------+---------------+---------+--------------------+------------+--------------------+--------------------+--------------------+-------+------------------+---------+-------------+------------+----------------+-----+-------+-------+-----------+---------+-------------+------------+----------------+-----+-------+-------+-----------+--------------------+--------+------+---------+------+---------+------+-----------+-----------+-----------+----------+------+--------+-------+---------+-------------+------+----------+--------------+------+----------+--------------+---------+---------+-------------+------------+--------------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+----------+-----+-------+--------+------+--------+--------+--------+----------+--------------+---------+-----------+---------+--------+----------+------+-----+------+-------------+------+---------+-----------+----------+------------+----------+--------------+------------------+---------+--------+------+------+------+--------+-------+--------+--------+-------+-------+ | eventid|iyear|imonth|iday|approxdate|extended|resolution|country| country_txt|region| region_txt|provstate| city| latitude| longitude|specificity|vicinity|location|summary|crit1|crit2|crit3|doubtterr|alternative|alternative_txt|multiple|success|suicide|attacktype1| attacktype1_txt|attacktype2|attacktype2_txt|attacktype3|attacktype3_txt|targtype1| targtype1_txt|targsubtype1| targsubtype1_txt| corp1| target1|natlty1| natlty1_txt|targtype2|targtype2_txt|targsubtype2|targsubtype2_txt|corp2|target2|natlty2|natlty2_txt|targtype3|targtype3_txt|targsubtype3|targsubtype3_txt|corp3|target3|natlty3|natlty3_txt| gname|gsubname|gname2|gsubname2|gname3|gsubname3|motive|guncertain1|guncertain2|guncertain3|individual|nperps|nperpcap|claimed|claimmode|claimmode_txt|claim2|claimmode2|claimmode2_txt|claim3|claimmode3|claimmode3_txt|compclaim|weaptype1|weaptype1_txt|weapsubtype1| weapsubtype1_txt|weaptype2|weaptype2_txt|weapsubtype2|weapsubtype2_txt|weaptype3|weaptype3_txt|weapsubtype3|weapsubtype3_txt|weaptype4|weaptype4_txt|weapsubtype4|weapsubtype4_txt|weapdetail|nkill|nkillus|nkillter|nwound|nwoundus|nwoundte|property|propextent|propextent_txt|propvalue|propcomment|ishostkid|nhostkid|nhostkidus|nhours|ndays|divert|kidhijcountry|ransom|ransomamt|ransomamtus|ransompaid|ransompaidus|ransomnote|hostkidoutcome|hostkidoutcome_txt|nreleased|addnotes|scite1|scite2|scite3|dbsource|INT_LOG|INT_IDEO|INT_MISC|INT_ANY|related| +------------+-----+------+----+----------+--------+----------+-------+------------------+------+--------------------+---------+-------------+---------+----------+-----------+--------+--------+-------+-----+-----+-----+---------+-----------+---------------+--------+-------+-------+-----------+--------------------+-----------+---------------+-----------+---------------+---------+--------------------+------------+--------------------+--------------------+--------------------+-------+------------------+---------+-------------+------------+----------------+-----+-------+-------+-----------+---------+-------------+------------+----------------+-----+-------+-------+-----------+--------------------+--------+------+---------+------+---------+------+-----------+-----------+-----------+----------+------+--------+-------+---------+-------------+------+----------+--------------+------+----------+--------------+---------+---------+-------------+------------+--------------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+----------+-----+-------+--------+------+--------+--------+--------+----------+--------------+---------+-----------+---------+--------+----------+------+-----+------+-------------+------+---------+-----------+----------+------------+----------+--------------+------------------+---------+--------+------+------+------+--------+-------+--------+--------+-------+-------+ |197000000001| 1970| 7| 2| NULL| 0| NULL| 58|Dominican Republic| 2|Central America &...| NULL|Santo Domingo|18.456792|-69.951164| 1| 0| NULL| NULL| 1| 1| 1| 0| NULL| NULL| 0| 1| 0| 1| Assassination| NULL| NULL| NULL| NULL| 14|Private Citizens ...| 68| Named Civilian| NULL| Julio Guzman| 58|Dominican Republic| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| MANO-D| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| 13| Unknown| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| 1| NULL| NULL| 0| NULL| NULL| 0| NULL| NULL| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| PGIS| 0| 0| 0| 0| NULL| |197000000002| 1970| 0| 0| NULL| 0| NULL| 130| Mexico| 1| North America| Federal| Mexico city|19.371887|-99.086624| 1| 0| NULL| NULL| 1| 1| 1| 0| NULL| NULL| 0| 1| 0| 6|Hostage Taking (K...| NULL| NULL| NULL| NULL| 7|Government (Diplo...| 45|Diplomatic Person...|Belgian Ambassado...|Nadine Chaval, da...| 21| Belgium| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL|23rd of September...| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| 0| 7| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| 13| Unknown| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| 0| NULL| NULL| 0| NULL| NULL| NULL| NULL| 1| 1| 0| NULL| NULL| NULL| Mexico| 1| 800000| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| PGIS| 0| 1| 1| 1| NULL| |197001000001| 1970| 1| 0| NULL| 0| NULL| 160| Philippines| 5| Southeast Asia| Tarlac| Unknown|15.478598|120.599741| 4| 0| NULL| NULL| 1| 1| 1| 0| NULL| NULL| 0| 1| 0| 1| Assassination| NULL| NULL| NULL| NULL| 10| Journalists & Media| 54|Radio Journalist/...| Voice of America| Employee| 217| United States| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| Unknown| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| 13| Unknown| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| 1| NULL| NULL| 0| NULL| NULL| 0| NULL| NULL| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| PGIS| -9| -9| 1| 1| NULL| |197001000002| 1970| 1| 0| NULL| 0| NULL| 78| Greece| 8| Western Europe| Attica| Athens| 37.99749| 23.762728| 1| 0| NULL| NULL| 1| 1| 1| 0| NULL| NULL| 0| 1| 0| 3| Bombing/Explosion| NULL| NULL| NULL| NULL| 7|Government (Diplo...| 46| Embassy/Consulate| NULL| U.S. Embassy| 217| United States| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| Unknown| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| 6| Explosives| 16|Unknown Explosive...| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| Explosive| NULL| NULL| NULL| NULL| NULL| NULL| 1| NULL| NULL| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| PGIS| -9| -9| 1| 1| NULL| |197001000003| 1970| 1| 0| NULL| 0| NULL| 101| Japan| 4| East Asia| Fukouka| Fukouka|33.580412|130.396361| 1| 0| NULL| NULL| 1| 1| 1| -9| NULL| NULL| 0| 1| 0| 7|Facility/Infrastr...| NULL| NULL| NULL| NULL| 7|Government (Diplo...| 46| Embassy/Consulate| NULL| U.S. Consulate| 217| United States| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| Unknown| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| 8| Incendiary| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL|Incendiary| NULL| NULL| NULL| NULL| NULL| NULL| 1| NULL| NULL| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| 0| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| NULL| PGIS| -9| -9| 1| 1| NULL| +------------+-----+------+----+----------+--------+----------+-------+------------------+------+--------------------+---------+-------------+---------+----------+-----------+--------+--------+-------+-----+-----+-----+---------+-----------+---------------+--------+-------+-------+-----------+--------------------+-----------+---------------+-----------+---------------+---------+--------------------+------------+--------------------+--------------------+--------------------+-------+------------------+---------+-------------+------------+----------------+-----+-------+-------+-----------+---------+-------------+------------+----------------+-----+-------+-------+-----------+--------------------+--------+------+---------+------+---------+------+-----------+-----------+-----------+----------+------+--------+-------+---------+-------------+------+----------+--------------+------+----------+--------------+---------+---------+-------------+------------+--------------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+----------+-----+-------+--------+------+--------+--------+--------+----------+--------------+---------+-----------+---------+--------+----------+------+-----+------+-------------+------+---------+-----------+----------+------------+----------+--------------+------------------+---------+--------+------+------+------+--------+-------+--------+--------+-------+-------+ only showing top 5 rows
# Get last 5 rows
tail_rows = raw_gtd_df.collect()[-5:]
for row in tail_rows:
print(row)
Row(eventid=201712310022, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=182, country_txt='Somalia', region=11, region_txt='Sub-Saharan Africa', provstate='Middle Shebelle', city='Ceelka Geelow', latitude=2.359673, longitude=45.385034, specificity=2, vicinity=0, location='The incident occurred near the town of Balcad.', summary='12/31/2017: Assailants opened fire on a Somali National Army (SNA) checkpoint in Ceelka Geelow, Middle Shebelle, Somalia. At least one soldier was killed and two soldiers were injured in the ensuing clash. Al-Shabaab claimed responsibility for the attack.', crit1='1', crit2='1', crit3='0', doubtterr='1', alternative='1', alternative_txt='Insurgency/Guerilla Action', multiple='0', success='1', suicide='0', attacktype1='2', attacktype1_txt='Armed Assault', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='4', targtype1_txt='Military', targsubtype1='36', targsubtype1_txt='Military Checkpoint', corp1='Somali National Army (SNA)', target1='Checkpoint', natlty1='182', natlty1_txt='Somalia', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Al-Shabaab', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='1', claimmode='10', claimmode_txt='Unknown', claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='5', weaptype1_txt='Firearms', weapsubtype1='5', weapsubtype1_txt='Unknown Gun Type', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail=None, nkill='1', nkillus='0', nkillter='0', nwound='2', nwoundus='0', nwoundte='0', property='-9', propextent=None, propextent_txt=None, propvalue=None, propcomment=None, ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Somalia: Al-Shabaab Militants Attack Army Checkpoint in Middle Shabeelle Region', scite2='"" Hiiraan Online', scite3=' January 1', dbsource=' 2018."', INT_LOG='"""Highlights: Somalia Daily Media Highlights 2 January 2018', INT_IDEO='"" Summary', INT_MISC=' January 3', INT_ANY=' 2018."', related='"""Highlights: Somalia Daily Media Highlights 1 January 2018') Row(eventid=201712310029, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=200, country_txt='Syria', region=10, region_txt='Middle East & North Africa', provstate='Lattakia', city='Jableh', latitude=35.407278, longitude=35.942679, specificity=1, vicinity=1, location='The incident occurred at the Humaymim Airport.', summary='12/31/2017: Assailants launched mortars at the Hmeymim Air Base in Jableh, Lattakia, Syria. Two Russian soldiers were killed and ten people were injured in the attack. No group claimed responsibility for the incident; however, sources attributed the attack to Muslim extremists.', crit1='1', crit2='1', crit3='0', doubtterr='1', alternative='1', alternative_txt='Insurgency/Guerilla Action', multiple='0', success='1', suicide='0', attacktype1='3', attacktype1_txt='Bombing/Explosion', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='4', targtype1_txt='Military', targsubtype1='27', targsubtype1_txt='Military Barracks/Base/Headquarters/Checkpost', corp1='Russian Air Force', target1='Hmeymim Air Base', natlty1='167', natlty1_txt='Russia', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Muslim extremists', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='0', claimmode=None, claimmode_txt=None, claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='6', weaptype1_txt='Explosives', weapsubtype1='11', weapsubtype1_txt='Projectile (rockets, mortars, RPGs, etc.)', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail='Mortars were used in the attack.', nkill='2', nkillus='0', nkillter='0', nwound='7', nwoundus='0', nwoundte='0', property='1', propextent='4', propextent_txt='Unknown', propvalue='-99', propcomment='Seven military planes were damaged in this attack.', ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Putin\'s \'victory\' in Syria has turned into a farce - Turchynov', scite2='"" MENA English (Middle East and North Africa Financial Network)', scite3=' January 5', dbsource=' 2018."', INT_LOG='"""Two Russian soldiers killed at Hmeymim base in Syria', INT_IDEO='"" Ansamed', INT_MISC=' January 4', INT_ANY=' 2018."', related='"""Two Russian servicemen killed in Syria mortar attack') Row(eventid=201712310030, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=160, country_txt='Philippines', region=5, region_txt='Southeast Asia', provstate='Maguindanao', city='Kubentog', latitude=6.900742, longitude=124.437908, specificity=2, vicinity=0, location='The incident occurred in the Datu Hoffer district.', summary='12/31/2017: Assailants set fire to houses in Kubentog, Datu Hoffer, Maguindanao, Philippines. There were no reported casualties in the attack. No group claimed responsibility for the incident; however, sources attributed the attack to the Bangsamoro Islamic Freedom Movement (BIFM).', crit1='1', crit2='1', crit3='1', doubtterr='0', alternative=None, alternative_txt=None, multiple='0', success='1', suicide='0', attacktype1='7', attacktype1_txt='Facility/Infrastructure Attack', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='14', targtype1_txt='Private Citizens & Property', targsubtype1='76', targsubtype1_txt='House/Apartment/Residence', corp1='Not Applicable', target1='Houses', natlty1='160', natlty1_txt='Philippines', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Bangsamoro Islamic Freedom Movement (BIFM)', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='0', claimmode=None, claimmode_txt=None, claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='8', weaptype1_txt='Incendiary', weapsubtype1='18', weapsubtype1_txt='Arson/Fire', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail=None, nkill='0', nkillus='0', nkillter='0', nwound='0', nwoundus='0', nwoundte='0', property='1', propextent='4', propextent_txt='Unknown', propvalue='-99', propcomment='Houses were damaged in this attack.', ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Maguindanao clashes trap tribe members', scite2='"" Philippines Daily Inquirer', scite3=' January 3', dbsource=' 2018."', INT_LOG=None, INT_IDEO=None, INT_MISC='START Primary Collection', INT_ANY='0', related='0') Row(eventid=201712310031, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=92, country_txt='India', region=6, region_txt='South Asia', provstate='Manipur', city='Imphal', latitude=24.798346, longitude=93.94043, specificity=1, vicinity=0, location='The incident occurred in the Mantripukhri neighborhood.', summary='12/31/2017: Assailants threw a grenade at a Forest Department office in Mantripukhri neighborhood, Imphal, Manipur, India. No casualties were reported in the blast. No group claimed responsibility for the incident.', crit1='1', crit2='1', crit3='1', doubtterr='0', alternative=None, alternative_txt=None, multiple='0', success='0', suicide='0', attacktype1='3', attacktype1_txt='Bombing/Explosion', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='2', targtype1_txt='Government (General)', targsubtype1='21', targsubtype1_txt='Government Building/Facility/Office', corp1='Forest Department Manipur', target1='Office', natlty1='92', natlty1_txt='India', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Unknown', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='0', claimmode=None, claimmode_txt=None, claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='6', weaptype1_txt='Explosives', weapsubtype1='7', weapsubtype1_txt='Grenade', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail='A thrown grenade was used in the attack.', nkill='0', nkillus='0', nkillter='0', nwound='0', nwoundus='0', nwoundte='0', property='-9', propextent=None, propextent_txt=None, propvalue=None, propcomment=None, ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Trader escapes grenade attack in Imphal', scite2='"" Business Standard India', scite3=' January 3', dbsource=' 2018."', INT_LOG=None, INT_IDEO=None, INT_MISC='START Primary Collection', INT_ANY='-9', related='-9') Row(eventid=201712310032, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=160, country_txt='Philippines', region=5, region_txt='Southeast Asia', provstate='Maguindanao', city='Cotabato City', latitude=7.209594, longitude=124.241966, specificity=1, vicinity=0, location=None, summary='12/31/2017: An explosive device was discovered and defused at a plaza in Cotabato City, Maguindanao, Philippines. No group claimed responsibility for the incident.', crit1='1', crit2='1', crit3='1', doubtterr='0', alternative=None, alternative_txt=None, multiple='0', success='0', suicide='0', attacktype1='3', attacktype1_txt='Bombing/Explosion', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='20', targtype1_txt='Unknown', targsubtype1=None, targsubtype1_txt=None, corp1='Unknown', target1='Unknown', natlty1='160', natlty1_txt='Philippines', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Unknown', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='0', claimmode=None, claimmode_txt=None, claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='6', weaptype1_txt='Explosives', weapsubtype1='16', weapsubtype1_txt='Unknown Explosive Type', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail='An explosive device containing a detonating cord, a battery, and a blasting cap was used in the attack.', nkill='0', nkillus='0', nkillter='0', nwound='0', nwoundus='0', nwoundte='0', property='0', propextent=None, propextent_txt=None, propvalue=None, propcomment=None, ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Security tightened in Cotabato following IED discovery', scite2='"" Tempo', scite3=' January 4', dbsource=' 2018."', INT_LOG='"""Security tightened in Cotabato City', INT_IDEO='"" Manila Bulletin', INT_MISC=' January 3', INT_ANY=' 2018."', related=None)
from pyspark.sql import functions as F
# Select required columns
gtd_df = raw_gtd_df.select(
'iyear', 'imonth', 'iday', 'region_txt', 'country_txt', 'provstate',
'latitude', 'longitude', 'success', 'attacktype1_txt', 'targtype1_txt',
'target1', 'weaptype1_txt', 'gname', 'suicide', 'nkill', 'nwound',
'nkillter', 'summary', 'motive', 'propextent', 'dbsource'
)
# Rename columns for readability
gtd_df = gtd_df.withColumnRenamed('iyear', 'year') \
.withColumnRenamed('imonth', 'month') \
.withColumnRenamed('iday', 'day') \
.withColumnRenamed('region_txt', 'region') \
.withColumnRenamed('country_txt', 'country') \
.withColumnRenamed('provstate', 'province') \
.withColumnRenamed('attacktype1_txt', 'attack_type') \
.withColumnRenamed('targtype1_txt', 'target_type') \
.withColumnRenamed('target1', 'target') \
.withColumnRenamed('weaptype1_txt', 'weapon_type') \
.withColumnRenamed('gname', 'terror_group') \
.withColumnRenamed('nkill', 'killed') \
.withColumnRenamed('nwound', 'wounded') \
.withColumnRenamed('nkillter', 'perpetrator_kill') \
.withColumnRenamed('propextent', 'propextent')
rows = gtd_df.count()
cols = len(gtd_df.columns)
print(f"Shape: ({rows}, {cols})")
Shape: (181691, 22)
I then have effectively streamlined the dataset by dropping irrelevant or redundant columns and retaining only the most essential features related to the analysis of terrorist incidents. Specifically, I selected 22 key columns from the original 135, focusing on attributes such as the date and location of the attack (year, month, day, region, country, province, latitude, longitude), attack characteristics (success, attack_type, weapon_type, suicide), target and perpetrator details (target_type, target, terror_group, perpetrator_kill), casualty counts (killed, wounded), and additional contextual information (summary, motive, propextent, dbsource). I also renamed these columns to more readable and intuitive names, which improves the clarity of future analysis. As a result, the dataset was reduced to 181,691 rows and 22 columns, containing the most relevant information needed for in-depth terrorism data analysis.
#To check the name of features.
# Check the new column names
print(gtd_df.columns)
['year', 'month', 'day', 'region', 'country', 'province', 'latitude', 'longitude', 'success', 'attack_type', 'target_type', 'target', 'weapon_type', 'terror_group', 'suicide', 'killed', 'wounded', 'perpetrator_kill', 'summary', 'motive', 'propextent', 'dbsource']
# Show first 5 rows
gtd_df.show(5)
+----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+ |year|month|day| region| country|province| latitude| longitude|success| attack_type| target_type| target|weapon_type| terror_group|suicide|killed|wounded|perpetrator_kill|summary|motive|propextent|dbsource| +----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+ |1970| 7| 2|Central America &...|Dominican Republic| NULL|18.456792|-69.951164| 1| Assassination|Private Citizens ...| Julio Guzman| Unknown| MANO-D| 0| 1| 0| NULL| NULL| NULL| NULL| PGIS| |1970| 0| 0| North America| Mexico| Federal|19.371887|-99.086624| 1|Hostage Taking (K...|Government (Diplo...|Nadine Chaval, da...| Unknown|23rd of September...| 0| 0| 0| NULL| NULL| NULL| NULL| PGIS| |1970| 1| 0| Southeast Asia| Philippines| Tarlac|15.478598|120.599741| 1| Assassination| Journalists & Media| Employee| Unknown| Unknown| 0| 1| 0| NULL| NULL| NULL| NULL| PGIS| |1970| 1| 0| Western Europe| Greece| Attica| 37.99749| 23.762728| 1| Bombing/Explosion|Government (Diplo...| U.S. Embassy| Explosives| Unknown| 0| NULL| NULL| NULL| NULL| NULL| NULL| PGIS| |1970| 1| 0| East Asia| Japan| Fukouka|33.580412|130.396361| 1|Facility/Infrastr...|Government (Diplo...| U.S. Consulate| Incendiary| Unknown| 0| NULL| NULL| NULL| NULL| NULL| NULL| PGIS| +----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+ only showing top 5 rows
# Parse Dates
gtd_df = gtd_df.withColumn(
'date',
F.to_date(F.concat_ws('-', F.col('year'), F.col('month'), F.col('day')), 'yyyy-M-d')
)
# Add casualties column
gtd_df = gtd_df.withColumn(
'casualties',
F.coalesce(F.col('killed'), F.lit(0)) + F.coalesce(F.col('wounded'), F.lit(0))
)
rows = gtd_df.count()
cols = len(gtd_df.columns)
print(f"Shape: ({rows}, {cols})")
Shape: (181691, 24)
# Show first 5 rows
gtd_df.show(5)
+----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+----------+----------+ |year|month|day| region| country|province| latitude| longitude|success| attack_type| target_type| target|weapon_type| terror_group|suicide|killed|wounded|perpetrator_kill|summary|motive|propextent|dbsource| date|casualties| +----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+----------+----------+ |1970| 7| 2|Central America &...|Dominican Republic| NULL|18.456792|-69.951164| 1| Assassination|Private Citizens ...| Julio Guzman| Unknown| MANO-D| 0| 1| 0| NULL| NULL| NULL| NULL| PGIS|1970-07-02| 1.0| |1970| 0| 0| North America| Mexico| Federal|19.371887|-99.086624| 1|Hostage Taking (K...|Government (Diplo...|Nadine Chaval, da...| Unknown|23rd of September...| 0| 0| 0| NULL| NULL| NULL| NULL| PGIS| NULL| 0.0| |1970| 1| 0| Southeast Asia| Philippines| Tarlac|15.478598|120.599741| 1| Assassination| Journalists & Media| Employee| Unknown| Unknown| 0| 1| 0| NULL| NULL| NULL| NULL| PGIS| NULL| 1.0| |1970| 1| 0| Western Europe| Greece| Attica| 37.99749| 23.762728| 1| Bombing/Explosion|Government (Diplo...| U.S. Embassy| Explosives| Unknown| 0| NULL| NULL| NULL| NULL| NULL| NULL| PGIS| NULL| 0.0| |1970| 1| 0| East Asia| Japan| Fukouka|33.580412|130.396361| 1|Facility/Infrastr...|Government (Diplo...| U.S. Consulate| Incendiary| Unknown| 0| NULL| NULL| NULL| NULL| NULL| NULL| PGIS| NULL| 0.0| +----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+----------+----------+ only showing top 5 rows
For feature engineering, I added a new column called casualties, which combines the number of people killed and wounded in each incident to provide a more comprehensive measure of the total human impact of an attack. This variable is crucial for later severity analysis, clustering, and risk scoring, as it captures the full extent of harm caused. Additionally, I parsed the date from the separate year, month, and day columns into a single date column using pd.to_datetime, enabling easier time-series analysis, trend forecasting, and chronological filtering. These feature engineering steps are important because they create new, analytically valuable variables and transform the dataset into a format that is more suitable for visualization, modeling, and insightful interpretation. This increased the no. of columns from 22 to 24.
# Method 1: Print schema nicely
gtd_df.printSchema()
# Method 2: Get a list of column names and their types
gtd_df.dtypes
root |-- year: integer (nullable = true) |-- month: integer (nullable = true) |-- day: integer (nullable = true) |-- region: string (nullable = true) |-- country: string (nullable = true) |-- province: string (nullable = true) |-- latitude: double (nullable = true) |-- longitude: double (nullable = true) |-- success: string (nullable = true) |-- attack_type: string (nullable = true) |-- target_type: string (nullable = true) |-- target: string (nullable = true) |-- weapon_type: string (nullable = true) |-- terror_group: string (nullable = true) |-- suicide: string (nullable = true) |-- killed: string (nullable = true) |-- wounded: string (nullable = true) |-- perpetrator_kill: string (nullable = true) |-- summary: string (nullable = true) |-- motive: string (nullable = true) |-- propextent: string (nullable = true) |-- dbsource: string (nullable = true) |-- date: date (nullable = true) |-- casualties: double (nullable = true)
[('year', 'int'),
('month', 'int'),
('day', 'int'),
('region', 'string'),
('country', 'string'),
('province', 'string'),
('latitude', 'double'),
('longitude', 'double'),
('success', 'string'),
('attack_type', 'string'),
('target_type', 'string'),
('target', 'string'),
('weapon_type', 'string'),
('terror_group', 'string'),
('suicide', 'string'),
('killed', 'string'),
('wounded', 'string'),
('perpetrator_kill', 'string'),
('summary', 'string'),
('motive', 'string'),
('propextent', 'string'),
('dbsource', 'string'),
('date', 'date'),
('casualties', 'double')]
from pyspark.sql import functions as F
numerical_vars = ['killed', 'wounded', 'casualties', 'suicide', 'year', 'latitude', 'longitude']
# 1. Count, mean, stddev
summary_stats = gtd_df.select(numerical_vars).describe().filter(F.col("summary").isin("count", "mean", "stddev"))
summary_stats.show(truncate=False)
# 2. Median
median_df = gtd_df.select([F.expr(f'percentile_approx({c}, 0.5)').alias(c) for c in numerical_vars])
median_df.show(truncate=False)
# 3. Mode
mode_dict = {}
for c in numerical_vars:
mode_val = gtd_df.groupBy(c).count().orderBy(F.desc('count')).first()[0]
mode_dict[c] = mode_val
print("Mode values:")
for k, v in mode_dict.items():
print(f"{k}: {v}")
+-------+------------------+------------------+-----------------+-------------------+------------------+------------------+------------------+ |summary|killed |wounded |casualties |suicide |year |latitude |longitude | +-------+------------------+------------------+-----------------+-------------------+------------------+------------------+------------------+ |count |170794 |165291 |181476 |181606 |181691 |177135 |177134 | |mean |2.4038225968104 |3.1526405286503563|5.130154951618947|0.03772992620332636|2002.6389969783863|23.49834295928318 |-458.6956530247027| |stddev |11.554775970212424|35.9396365074308 |40.55104567528549|0.19091822119542906|13.259430466246835|18.569242421025763|204778.9886113944 | +-------+------------------+------------------+-----------------+-------------------+------------------+------------------+------------------+ +------+-------+----------+-------+----+---------+---------+ |killed|wounded|casualties|suicide|year|latitude |longitude| +------+-------+----------+-------+----+---------+---------+ |0.0 |0.0 |1.0 |0.0 |2009|31.467463|43.243996| +------+-------+----------+-------+----+---------+---------+ Mode values: killed: 0 wounded: 0 casualties: 0.0 suicide: 0 year: 2014 latitude: 33.303566 longitude: 44.371773
An initial exploratory analysis was conducted on the numerical variables of the Global Terrorism Database, including killed, wounded, casualties, suicide, year, latitude, and longitude. The dataset contains varying numbers of observations per variable, with counts ranging from 165,291 for wounded to 181,606 for suicide. On average, incidents resulted in approximately 2.4 deaths, 3.15 wounded, and 5.13 total casualties, with a very small proportion (0.038) involving suicide attacks. The standard deviations indicate high variability in casualties and geographic coordinates, reflecting the sporadic and widespread nature of terrorist incidents. Median values show that more than half of the incidents involved zero deaths, zero wounded, and zero suicide attacks, highlighting that many events were minor in scale. Mode analysis confirms that the most frequently reported values were zero for casualties, deaths, and wounded, with the most common year being 2014, and the most common locations concentrated around latitude 33.3 and longitude 44.37. Overall, these statistics provide a foundational understanding of the distribution, central tendency, and variability in the dataset, guiding further analysis.
Now for data preprocessing, I have first done the preprocessing for numerical variables.
To ensure the completeness and reliability of the dataset, missing values were systematically identified and addressed. A custom function was created to quantify both the absolute and relative extent of missing values across columns. Based on this analysis, columns such as motive, propextent, and perpetrator_kill were removed due to having a high proportion of missing data, rendering them unsuitable for analysis. The summary column, primarily textual in nature, had numerous missing entries, which were replaced with the placeholder "Unknown" to retain the structure without introducing bias. For numerical variables like killed and wounded, where missing values were relatively sparse, median imputation was employed to maintain the distribution of the data and prevent skewness that could arise from extreme values.
from pyspark.sql.types import IntegerType
# Cast numeric columns that are stored as strings
numeric_cols = ['success', 'suicide', 'killed', 'wounded', 'casualties']
for col in numeric_cols:
gtd_df = gtd_df.withColumn(col, gtd_df[col].cast(IntegerType()))
from pyspark.sql.functions import col, sum, count, round
# total number of rows
total_rows = gtd_df.count()
# missing values count & percentage for each column
missing_df = gtd_df.select([
sum(col(c).isNull().cast("int")).alias(c + "_missing") for c in gtd_df.columns
])
# reshape into column, missing_count, missing_percent
missing_long = (
missing_df.selectExpr("stack(" + str(len(gtd_df.columns)) + "," +
",".join([f"'{c}', {c}_missing" for c in gtd_df.columns]) +
") as (column_name, missing_count)")
.withColumn("missing_percent", round((col("missing_count")/total_rows)*100, 2))
)
missing_long.orderBy(col("missing_percent").desc()).show(truncate=False)
+----------------+-------------+---------------+ |column_name |missing_count|missing_percent| +----------------+-------------+---------------+ |motive |131567 |72.41 | |propextent |117295 |64.56 | |perpetrator_kill|67146 |36.96 | |summary |66129 |36.4 | |wounded |16440 |9.05 | |killed |11074 |6.09 | |latitude |4556 |2.51 | |longitude |4557 |2.51 | |date |891 |0.49 | |dbsource |877 |0.48 | |target |682 |0.38 | |terror_group |487 |0.27 | |weapon_type |436 |0.24 | |province |421 |0.23 | |target_type |263 |0.14 | |casualties |215 |0.12 | |success |207 |0.11 | |suicide |111 |0.06 | |attack_type |35 |0.02 | |year |0 |0.0 | +----------------+-------------+---------------+ only showing top 20 rows
# -----------------------------
# 1. Remove columns with high missing proportion
# -----------------------------
# In your pandas code: ['motive','propextent','perpetrator_kill']
gtd_df = gtd_df.drop('motive', 'propextent', 'perpetrator_kill')
# -----------------------------
# 2. Fill 'summary' column nulls with 'Unknown'
# -----------------------------
gtd_df = gtd_df.withColumn('summary', F.coalesce(F.col('summary'), F.lit('Unknown')))
# -----------------------------
# 3. Fill 'killed' and 'wounded' nulls with median
# -----------------------------
# Calculate medians
median_killed = gtd_df.approxQuantile("killed", [0.5], 0.0)[0]
median_wounded = gtd_df.approxQuantile("wounded", [0.5], 0.0)[0]
gtd_df = gtd_df.withColumn("killed", F.coalesce(F.col("killed"), F.lit(median_killed)))
gtd_df = gtd_df.withColumn("wounded", F.coalesce(F.col("wounded"), F.lit(median_wounded)))
num_rows = gtd_df.count()
num_cols = len(gtd_df.columns)
print(f"Shape: ({num_rows}, {num_cols})")
Shape: (181691, 21)
# total number of rows
total_rows = gtd_df.count()
# missing values count & percentage for each column
missing_df = gtd_df.select([
sum(col(c).isNull().cast("int")).alias(c + "_missing") for c in gtd_df.columns
])
# reshape into column, missing_count, missing_percent
missing_long = (
missing_df.selectExpr("stack(" + str(len(gtd_df.columns)) + "," +
",".join([f"'{c}', {c}_missing" for c in gtd_df.columns]) +
") as (column_name, missing_count)")
.withColumn("missing_percent", round((col("missing_count")/total_rows)*100, 2))
)
missing_long.orderBy(col("missing_percent").desc()).show(truncate=False)
+------------+-------------+---------------+ |column_name |missing_count|missing_percent| +------------+-------------+---------------+ |latitude |4556 |2.51 | |longitude |4557 |2.51 | |date |891 |0.49 | |dbsource |877 |0.48 | |target |682 |0.38 | |terror_group|487 |0.27 | |weapon_type |436 |0.24 | |province |421 |0.23 | |target_type |263 |0.14 | |casualties |215 |0.12 | |success |207 |0.11 | |suicide |111 |0.06 | |attack_type |35 |0.02 | |year |0 |0.0 | |month |0 |0.0 | |day |0 |0.0 | |region |0 |0.0 | |country |0 |0.0 | |killed |0 |0.0 | |wounded |0 |0.0 | +------------+-------------+---------------+ only showing top 20 rows
import matplotlib.pyplot as plt
import pandas as pd
from pyspark.sql.functions import col, sum
# Total rows
total_rows = gtd_df.count()
# Compute missing values per column
missing_df = gtd_df.select([sum(col(c).isNull().cast("int")).alias(c) for c in gtd_df.columns])
# Convert to Pandas for visualization
missing_pd = missing_df.toPandas().T.reset_index()
missing_pd.columns = ['column', 'missing_count']
missing_pd['missing_percent'] = (missing_pd['missing_count'] / total_rows) * 100
# Plot using Matplotlib
plt.figure(figsize=(10,6))
plt.bar(missing_pd['column'], missing_pd['missing_percent'], color=(0.75,0.75,0.475))
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.ylabel('% of Missing Values', fontsize=12)
plt.title('Missing Values per Column', fontsize=14)
plt.show()
# visualising where exactly the missing values are
msno.matrix(gtd_df,figsize = (10,6),fontsize = 12,color = (0.75,.50,0.25))
<Axes: >
Duplicate records in the dataset can lead to biased analysis, especially in frequency-based or aggregate computations. A check for duplicated rows revealed 3,170 duplicates within the dataset. These redundant entries were promptly removed using the drop_duplicates() function. Following this operation, the number of records was reduced from 181,691 to 172,141, ensuring each incident was uniquely represented and analytical results were not distorted by repeated cases.
# -----------------------------
# 1. Count duplicate rows
# -----------------------------
# Create a hash column of all columns to identify duplicates
from pyspark.sql.functions import concat_ws
gtd_df_with_hash = gtd_df.withColumn("row_hash", concat_ws("_", *gtd_df.columns))
duplicate_count = gtd_df_with_hash.groupBy("row_hash").count().filter(F.col("count") > 1).count()
print("Number of duplicate rows:", duplicate_count)
Number of duplicate rows: 3170
# -----------------------------
# 2. Display duplicate rows (optional)
# -----------------------------
duplicate_rows_df = gtd_df_with_hash.groupBy(gtd_df.columns)\
.count()\
.filter(F.col("count") > 1)\
.drop("count")
duplicate_rows_df.show(truncate=False)
+----+-----+---+---------------------------+--------------+---------------------+----------+----------+-------+------------------------------+---------------------------+-------------------------+-----------+------------------------------------------------+-------+------+-------+-------+--------+----------+----------+ |year|month|day|region |country |province |latitude |longitude |success|attack_type |target_type |target |weapon_type|terror_group |suicide|killed|wounded|summary|dbsource|date |casualties| +----+-----+---+---------------------------+--------------+---------------------+----------+----------+-------+------------------------------+---------------------------+-------------------------+-----------+------------------------------------------------+-------+------+-------+-------+--------+----------+----------+ |1979|1 |6 |Western Europe |Italy |Lazio |41.890961 |12.490069 |1 |Facility/Infrastructure Attack|Business |Movie theater |Incendiary |Unknown |0 |0.0 |0.0 |Unknown|PGIS |1979-01-06|0 | |1986|2 |21 |Central America & Caribbean|El Salvador |Cuscatlan |13.682638 |-88.926466|1 |Bombing/Explosion |Utilities |electrical line post |Explosives |Farabundo Marti National Liberation Front (FMLN)|0 |0.0 |0.0 |Unknown|PGIS |1986-02-21|0 | |1987|4 |9 |South America |Peru |Lima |-11.967368|-76.978462|0 |Bombing/Explosion |Private Citizens & Property|Street |Explosives |Shining Path (SL) |0 |0.0 |0.0 |Unknown|PGIS |1987-04-09|0 | |1989|5 |17 |Central America & Caribbean|El Salvador |Cabanas |13.864829 |-88.7494 |1 |Bombing/Explosion |Utilities |115,000 Volt Power Line |Explosives |Farabundo Marti National Liberation Front (FMLN)|0 |0.0 |0.0 |Unknown|PGIS |1989-05-17|0 | |1991|4 |5 |South America |Peru |Lima |-11.975814|-76.7699 |1 |Bombing/Explosion |Utilities |High Tension Power Lines |Explosives |Shining Path (SL) |0 |0.0 |0.0 |Unknown|PGIS |1991-04-05|0 | |1991|5 |27 |Central America & Caribbean|El Salvador |Usulutan |13.516667 |-88.383333|1 |Bombing/Explosion |Utilities |High tension line tower* |Explosives |Farabundo Marti National Liberation Front (FMLN)|0 |0.0 |0.0 |Unknown|PGIS |1991-05-27|0 | |1992|7 |14 |Middle East & North Africa |Lebanon |North |34.438094 |35.830837 |1 |Bombing/Explosion |Private Citizens & Property|Beach |Explosives |Unknown |0 |0.0 |1.0 |Unknown|PGIS |1992-07-14|1 | |1992|10 |30 |Middle East & North Africa |Turkey |Istanbul |41.106178 |28.689863 |1 |Bombing/Explosion |Government (General) |Election Bureau |Explosives |Unknown |0 |0.0 |0.0 |Unknown|PGIS |1992-10-30|0 | |1994|5 |9 |Middle East & North Africa |Turkey |Adana |36.99154 |35.331051 |1 |Bombing/Explosion |Business |Automatic Teller Machine |Explosives |Unknown |0 |0.0 |0.0 |Unknown|PGIS |1994-05-09|0 | |1997|11 |29 |Western Europe |Spain |Basque Country |43.07563 |-2.223667 |1 |Unknown |Business |Bank |Unknown |Unknown |0 |0.0 |0.0 |Unknown|PGIS |1997-11-29|0 | |1983|10 |28 |South America |Chile |Unknown |NULL |NULL |1 |Bombing/Explosion |Transportation |Chilean railway line |Explosives |Unknown |0 |0.0 |0.0 |Unknown|PGIS |1983-10-28|0 | |1986|9 |1 |South America |Chile |Santiago Metropolitan|-33.366238|-70.505302|1 |Facility/Infrastructure Attack|Transportation |Bus |Incendiary |Manuel Rodriguez Patriotic Front (FPMR) |0 |0.0 |0.0 |Unknown|PGIS |1986-09-01|0 | |1981|5 |8 |Western Europe |Greece |Attica |37.99749 |23.762728 |1 |Bombing/Explosion |Police |Police Station |Explosives |Revolutionary People's Struggle (ELA) |0 |0.0 |0.0 |Unknown|PGIS |1981-05-08|0 | |1983|11 |10 |South America |Peru |Lima |-11.967368|-76.978462|1 |Bombing/Explosion |Utilities |High tension tower |Explosives |Shining Path (SL) |0 |0.0 |0.0 |Unknown|PGIS |1983-11-10|0 | |1984|1 |5 |Middle East & North Africa |Lebanon |South |33.550434 |35.370964 |1 |Bombing/Explosion |Military |Israeli Military Unit |Explosives |Shia Muslim extremists |0 |0.0 |0.0 |Unknown|PGIS |1984-01-05|0 | |1986|1 |3 |South America |Peru |Lima |-12.707508|-75.969184|1 |Bombing/Explosion |Utilities |High tension line tower |Explosives |Shining Path (SL) |0 |0.0 |0.0 |Unknown|PGIS |1986-01-03|0 | |1986|7 |1 |South America |Chile |Santiago Metropolitan|-33.366238|-70.505302|1 |Bombing/Explosion |Business |a yarrow processing plant|Unknown |Unknown |0 |0.0 |0.0 |Unknown|PGIS |1986-07-01|0 | |1987|6 |13 |Western Europe |Malta |South Eastern |35.89779 |14.514106 |1 |Facility/Infrastructure Attack|Business |Hotel |Incendiary |Unknown |0 |0.0 |0.0 |Unknown|PGIS |1987-06-13|0 | |1988|8 |27 |Western Europe |United Kingdom|Northern Ireland |55.011562 |-7.312045 |1 |Bombing/Explosion |Private Citizens & Property|Street |Explosives |Irish Republican Army (IRA) |0 |0.0 |1.0 |Unknown|PGIS |1988-08-27|1 | |1990|8 |28 |Western Europe |Spain |Basque Country |43.291618 |-1.977903 |1 |Bombing/Explosion |Business |bar |Explosives |Basque Fatherland and Freedom (ETA) |0 |0.0 |0.0 |Unknown|PGIS |1990-08-28|0 | +----+-----+---+---------------------------+--------------+---------------------+----------+----------+-------+------------------------------+---------------------------+-------------------------+-----------+------------------------------------------------+-------+------+-------+-------+--------+----------+----------+ only showing top 20 rows
# -----------------------------
# 3. Remove duplicate rows
# -----------------------------
gtd_df = gtd_df.dropDuplicates()
# -----------------------------
# 4. Check new shape
# -----------------------------
num_rows = gtd_df.count()
num_cols = len(gtd_df.columns)
print(f"Shape after removing duplicates: ({num_rows}, {num_cols})")
Shape after removing duplicates: (172141, 21)
Outliers in numerical features like killed, wounded, and casualties can disproportionately influence statistical analyses and model performance. To mitigate their impact, the Interquartile Range (IQR) method was applied. The first (Q1) and third quartiles (Q3) were computed for each variable to calculate the IQR, and upper thresholds were set at Q3 + 1.5*IQR. Rather than removing the identified outliers, a capping approach was used, replacing extreme values beyond the upper bound with the threshold value. This strategy preserved the data structure and sample size while limiting the influence of anomalously high values, particularly in terrorism-related incidents where some cases may report exceptionally large numbers of casualties.
# Summary statistics for selected numeric columns
gtd_df.select("killed", "wounded", "casualties").describe().show()
+-------+------------------+-----------------+------------------+ |summary| killed| wounded| casualties| +-------+------------------+-----------------+------------------+ | count| 172141| 172141| 171927| | mean|2.3614827379880445|3.008870635118885|5.3763981224589505| | stddev|11.496217161927367|35.21393062470437|41.637251045678035| | min| -9.0| 0.0| -4| | max| 1570.0| 8191.0| 9574| +-------+------------------+-----------------+------------------+
# Select only killed, wounded, casualties
gtd_df_pd = gtd_df.select("killed", "wounded", "casualties").toPandas()
# Melt for Plotly
melted_df = gtd_df_pd.melt(var_name="Metric", value_name="Count")
# Box plot
fig = px.box(
melted_df,
x="Metric",
y="Count",
title="Distribution of Killed, Wounded, and Casualties",
color="Metric",
points="outliers" # show outliers
)
fig.show()